singular value decompositions \(\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T\)
iterative methods for numerical linear algebra
Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).
Our major goal (or learning objectives) is to
know the complexity (flop count) of each task
be familiar with the BLAS and LAPACK functions (what they do)
do not re-invent wheels by implementing these dense linear algebra subroutines by yourself
understand the need for iterative methods
apply appropriate numerical algebra tools to various statistical problems
All high-level languages (Julia, Matlab, Python, R) call BLAS and LAPACK for numerical linear algebra.
Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.
2 BLAS
BLAS stands for basic linear algebra subprograms.
See netlib for a complete list of standardized BLAS functions.
Matlab uses Intel’s MKL (mathematical kernel libaries). MKL implementation is the gold standard on market. It is not open source but the compiled library is free for Linux and MacOS. However, not surprisingly, it only works on Intel CPUs.
Julia uses OpenBLAS. OpenBLAS is the best cross-platform, open source implementation. With the MKL.jl package, it’s also very easy to use MKL in Julia.
Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).
3 Examples
The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.
Some operations appear as level-3 but indeed are level-2.
Example 1. A common operation in statistics is column scaling or row scaling \[
\begin{eqnarray*}
\mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\
\mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)},
\end{eqnarray*}
\] where \(\mathbf{D}\) is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form \[
\mathbf{X}^T \mathbf{W} \mathbf{X},
\] where \(\mathbf{W}\) is a diagonal matrix with observation weights on diagonal.
Column and row scalings are essentially level-2 operations!
usingBenchmarkTools, LinearAlgebra, RandomRandom.seed!(257) # seedn =2000A =rand(n, n) # n-by-n matrixd =rand(n) # n vectorD =Diagonal(d) # diagonal matrix with d as diagonal
Example 2. Innter product between two matrices \(\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}\) is often written as \[
\text{trace}(\mathbf{A}^T \mathbf{B}), \text{trace}(\mathbf{B} \mathbf{A}^T), \text{trace}(\mathbf{A} \mathbf{B}^T), \text{ or } \text{trace}(\mathbf{B}^T \mathbf{A}).
\] They appear as level-3 operation (matrix multiplication with \(O(m^2n)\) or \(O(mn^2)\) flops).
Random.seed!(123)n =2000A, B =randn(n, n), randn(n, n)# slow way to evaluate tr(A'B): 2mn^2 flops@benchmarktr(transpose($A) *$B)
BenchmarkTools.Trial: 95 samples with 1 evaluation per sample.
Range (min … max): 47.436 ms … 79.546 ms┊ GC (min … max): 0.00% … 1.05%
Time (median): 52.123 ms ┊ GC (median): 1.60%
Time (mean ± σ): 52.616 ms ± 3.310 ms┊ GC (mean ± σ): 1.51% ± 0.94%
▅▇██▂█▅▂
▃▁▃▁▁▁▁▁▃▁▁▅▆▃▆████████▃▆▃▃▃▃▅▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃ ▁
47.4 ms Histogram: frequency by time 62.1 ms <
Memory estimate: 30.53 MiB, allocs estimate: 3.
But \(\text{trace}(\mathbf{A}^T \mathbf{B}) = <\text{vec}(\mathbf{A}), \text{vec}(\mathbf{B})>\). The latter is level-1 BLAS operation with \(O(mn)\) flops.
# smarter way to evaluate tr(A'B): 2mn flops@benchmarkdot($A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 414.166 μs … 14.183 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 434.375 μs ┊ GC (median): 0.00%
Time (mean ± σ): 454.625 μs ± 242.236 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁▇█▇▇▃▁
▂███████▆▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
414 μs Histogram: frequency by time 644 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Example 3. Similarly \(\text{diag}(\mathbf{A}^T \mathbf{B})\) can be calculated in \(O(mn)\) flops.
# slow way to evaluate diag(A'B): O(n^3)@benchmarkdiag(transpose($A) *$B)
BenchmarkTools.Trial: 95 samples with 1 evaluation per sample.
Range (min … max): 47.640 ms … 71.521 ms┊ GC (min … max): 0.00% … 1.28%
Time (median): 51.493 ms ┊ GC (median): 1.68%
Time (mean ± σ): 52.748 ms ± 3.421 ms┊ GC (mean ± σ): 1.52% ± 0.69%
▄█▃
▃▃▁▁▁▁▁▄▄▃▄███▇▆█▃▄▆▄▅▁▃▃▃▃▃▁▄▃▁▁▃▃▄▁▁▃▃▁▁▁▁▁▁▁▃▁▁▄▁▁▁▁▁▁▁▃ ▁
47.6 ms Histogram: frequency by time 63.9 ms <
Memory estimate: 30.55 MiB, allocs estimate: 6.
# smarter way to evaluate diag(A'B): O(n^2)@benchmarkDiagonal(vec(sum($A .*$B, dims =1)))
BenchmarkTools.Trial: 1195 samples with 1 evaluation per sample.
Range (min … max): 1.607 ms … 6.003 ms┊ GC (min … max): 0.00% … 37.05%
Time (median): 4.195 ms ┊ GC (median): 17.03%
Time (mean ± σ): 4.177 ms ± 499.270 μs┊ GC (mean ± σ): 15.16% ± 8.50%
▂▄█▃▁▃▃▃▃▃▂
▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▅▆▆▇▆▇▅▅▄▅▆▇████████████▆▅▆▅▅▄▃▃▂ ▄
1.61 ms Histogram: frequency by time 5.24 ms <
Memory estimate: 30.55 MiB, allocs estimate: 7.
To get rid of allocation of intermediate arrays at all, we can just write a double loop or use dot function.
functiondiag_matmul!(d, A, B) m, n =size(A)@assertsize(B) == (m, n) "A and B should have same size"fill!(d, 0)for j in1:n, i in1:m d[j] += A[i, j] * B[i, j]endDiagonal(d)endd =zeros(eltype(A), size(A, 2))@benchmarkdiag_matmul!($d, $A, $B)
BenchmarkTools.Trial: 1482 samples with 1 evaluation per sample.
Range (min … max): 3.339 ms … 3.507 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.350 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.370 ms ± 38.298 μs┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▂
██▆██▅▅▃▃▃▂▃▂▁▂▁▂▂▁▁▂▂▂▂▂▁▂▂▂▁▁▂▂▂▂▂▁▂▂▁▂▂▂▃▁▂▂▂▂▂▂▂▂▁▁▁▁▁ ▂
3.34 ms Histogram: frequency by time 3.47 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Exercise: Try @turbo (SIMD) and @tturbo (multi-threaded SIMD) from LoopVectorization.jl package.
4 Memory hierarchy and level-3 fraction
Key to high performance is effective use of memory hierarchy. True on all architectures.
Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.
In Julia, we can query the CPU topology by the Hwloc.jl package. For example, this laptop runs an Apple M2 Max chip with 4 efficiency cores and 8 performance cores.
For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.
Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
Answer: use high-level BLAS as much as possible.
A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called level-3 fraction.
To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is
Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.
5 Effect of data layout
Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.
Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).
Cache line is the minimum amount of cache which can be loaded and stored to memory.
x86 CPUs: 64 bytes
ARM CPUs: 32 bytes
In Julia, we can query the cache line size by Hwloc.jl.
# Apple Silicon (M1/M2 chips) don't have L3 cacheHwloc.cachelinesize()
ErrorException: Your system doesn't seem to have an L3 cache.
Your system doesn't seem to have an L3 cache.
Stacktrace:
[1] cachelinesize()
@ Hwloc ~/.julia/packages/Hwloc/IvkQ5/src/highlevel_api.jl:392
[2] top-level scope
@ ~/Documents/github.com/ucla-biostat-257/2025spring/slides/08-numalgintro/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X43sZmlsZQ==.jl:2
Accessing column-major stored matrix by rows (\(ij\) looping) causes lots of cache misses.
Take matrix multiplication as an example \[
\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}.
\] Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.
jki or kji looping:
# inner most loopfor i in1:m C[i, j] = C[i, j] + A[i, k] * B[k, j]end
- `ikj` or `kij` looping:
# inner most loop for j in1:n C[i, j] = C[i, j] + A[i, k] * B[k, j]end
ijk or jik looping:
# inner most loop for k in1:p C[i, j] = C[i, j] + A[i, k] * B[k, j]end
We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is
Variant
A Stride
B Stride
C Stride
\(jki\) or \(kji\)
Unit
0
Unit
\(ikj\) or \(kij\)
0
Non-Unit
Non-Unit
\(ijk\) or \(jik\)
Non-Unit
Unit
0
Apparently the variants \(jki\) or \(kji\) are preferred.
""" matmul_by_loop!(A, B, C, order)Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop."""functionmatmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String) m =size(A, 1) p =size(A, 2) n =size(B, 2)fill!(C, 0)if order =="jki"@inboundsfor j =1:n, k =1:p, i =1:m C[i, j] += A[i, k] * B[k, j]endendif order =="kji"@inboundsfor k =1:p, j =1:n, i =1:m C[i, j] += A[i, k] * B[k, j]endendif order =="ikj"@inboundsfor i =1:m, k =1:p, j =1:n C[i, j] += A[i, k] * B[k, j]endendif order =="kij"@inboundsfor k =1:p, i =1:m, j =1:n C[i, j] += A[i, k] * B[k, j]endendif order =="ijk"@inboundsfor i =1:m, j =1:n, k =1:p C[i, j] += A[i, k] * B[k, j]endendif order =="jik"@inboundsfor j =1:n, i =1:m, k =1:p C[i, j] += A[i, k] * B[k, j]endendendusingRandomRandom.seed!(123)m, p, n =2000, 100, 2000A =rand(m, p)B =rand(p, n)C =zeros(m, n);
BenchmarkTools.Trial: 86 samples with 1 evaluation per sample.
Range (min … max): 57.407 ms … 82.670 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 57.860 ms ┊ GC (median): 0.00%
Time (mean ± σ): 58.422 ms ± 2.735 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▇█▃
▆▇█▅▇███▃▆▅▇▆▆▇▅▇▁▁▁▁▃▁▃▅▁▃▁▁▁▁▁▃▁▁▁▁▃▁▃▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
57.4 ms Histogram: frequency by time 61.1 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmarkmatmul_by_loop!($A, $B, $C, "kji")
BenchmarkTools.Trial: 26 samples with 1 evaluation per sample.
Range (min … max): 187.500 ms … 225.509 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 197.176 ms ┊ GC (median): 0.00%
Time (mean ± σ): 198.485 ms ± 7.839 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▃ ▃ ▃ █▃ █ ▃
▇▁▁▁█▇█▁▇▁▇▁█▇▁█▁▁▇█▁▁█▇█▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
188 ms Histogram: frequency by time 226 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
\(ikj\) and \(kij\) looping:
@benchmarkmatmul_by_loop!($A, $B, $C, "ikj")
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
Range (min … max): 513.649 ms … 517.241 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 514.513 ms ┊ GC (median): 0.00%
Time (mean ± σ): 514.746 ms ± 1.078 ms┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁ ▁ █ ▁ █ ▁
█▁▁▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
514 ms Histogram: frequency by time 517 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmarkmatmul_by_loop!($A, $B, $C, "kij")
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
Range (min … max): 511.840 ms … 522.688 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 517.032 ms ┊ GC (median): 0.00%
Time (mean ± σ): 516.743 ms ± 3.278 ms┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ █ ██ █ █ █ █
█▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁██▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
512 ms Histogram: frequency by time 523 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
\(ijk\) and \(jik\) looping:
@benchmarkmatmul_by_loop!($A, $B, $C, "ijk")
BenchmarkTools.Trial: 21 samples with 1 evaluation per sample.
Range (min … max): 237.517 ms … 247.732 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 242.527 ms ┊ GC (median): 0.00%
Time (mean ± σ): 242.632 ms ± 3.128 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▁ █ ▁ ▁▁ ▁ ▁ ▁ ▁▁ ▁ ▁ █ ▁ ▁ ▁ ▁▁ ▁
█▁▁▁▁▁▁█▁█▁▁██▁█▁▁▁▁▁█▁▁▁█▁█▁█▁▁▁█▁▁█▁█▁▁▁▁█▁▁▁▁▁▁▁█▁█▁██▁▁▁█ ▁
238 ms Histogram: frequency by time 248 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmarkmatmul_by_loop!($A, $B, $C, "ijk")
BenchmarkTools.Trial: 21 samples with 1 evaluation per sample.
Range (min … max): 237.851 ms … 249.396 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 241.580 ms ┊ GC (median): 0.00%
Time (mean ± σ): 242.233 ms ± 3.356 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▃ ▃█
▇▁▁▁█▇▇▁▁▁▇▁▁▇▁▇▁▁▇██▁▁▁▇▁▁▁▁▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▇▇▁▁▁▁▁▇ ▁
238 ms Histogram: frequency by time 249 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Question: Can our loop beat BLAS? Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).
@benchmarkmul!($C, $A, $B)
BenchmarkTools.Trial: 1742 samples with 1 evaluation per sample.
Range (min … max): 2.549 ms … 24.514 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.721 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.867 ms ± 675.298 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▅▅▇▇█▇▅▅▆▅▄▄▃▃▂▂▂▂▁▁▁ ▁▁ ▁ ▁
████████████████████████████▇██▇▇▇▇▇▆▇▇▅▇▅▅▅▅▅▄▄▄▅▅▄▅▁▁▅▅▄▅ █
2.55 msHistogram: log(frequency) by time 4.23 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
# direct call of BLAS wrapper function@benchmarkLinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)
BenchmarkTools.Trial: 1748 samples with 1 evaluation per sample.
Range (min … max): 2.546 ms … 9.545 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.705 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.858 ms ± 530.714 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▃▇█▇▆▅▅▄▄▂▂▁▁▁ ▁
██████████████▇██▇▇█▅▆▆▇▆▆▅▆▆▅▇▇▅▅▄▅▅▄▅▁▄▁▄▁▅▅▁▄▁▁▅▄▄▄▁▄▁▄▄ █
2.55 msHistogram: log(frequency) by time 5.33 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Question (again): Can our loop beat BLAS?
Exercise: Annotate the loop in matmul_by_loop! by @turbo and @tturbo (multi-threading) and benchmark again.
6 BLAS in R
Tip for R users. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library.
usingRCallR"""sessionInfo()"""
RObject{VecSxp}
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] C
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.4.2
┌ Warning: RCall.jl:
│ Attaching package: 'dplyr'
│
│ The following objects are masked from 'package:stats':
│
│ filter, lag
│
│ The following objects are masked from 'package:base':
│
│ intersect, setdiff, setequal, union
│
└ @ RCall /Users/huazhou/.julia/packages/RCall/0ggIQ/src/io.jl:172
# A tibble: 1 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 A %*% B 127ms 128ms 7.49 30.5MB 7.49 4 4
total_time result memory time
<bch:tm> <list> <list> <list>
1 534ms <dbl [2,000 x 2,000]> <Rprofmem [1 x 3]> <bench_tm [4]>
gc
<list>
1 <tibble [4 x 3]>
┌ Warning: RCall.jl: Warning: Some expressions had a GC in every iteration; so filtering is disabled.
└ @ RCall /Users/huazhou/.julia/packages/RCall/0ggIQ/src/io.jl:172
Re-build R from source using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google build R using MKL to get started. Similarly we can build Julia using MKL.
Matlab uses MKL. Usually it’s very hard to beat Matlab in terms of linear algebra.
usingMATLABmat"""f = @() $A * $B;timeit(f)"""
ArgumentError: ArgumentError: Package MATLAB not found in current path.
- Run `import Pkg; Pkg.add("MATLAB")` to install the MATLAB package.
ArgumentError: Package MATLAB not found in current path.
- Run `import Pkg; Pkg.add("MATLAB")` to install the MATLAB package.
Stacktrace:
[1] macro expansion
@ ./loading.jl:2296 [inlined]
[2] macro expansion
@ ./lock.jl:273 [inlined]
[3] __require(into::Module, mod::Symbol)
@ Base ./loading.jl:2271
[4] #invoke_in_world#3
@ ./essentials.jl:1089 [inlined]
[5] invoke_in_world
@ ./essentials.jl:1086 [inlined]
[6] require(into::Module, mod::Symbol)
@ Base ./loading.jl:2260
[7] eval
@ ./boot.jl:430 [inlined]
[8] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:2734
[9] #invokelatest#2
@ ./essentials.jl:1055 [inlined]
[10] invokelatest
@ ./essentials.jl:1052 [inlined]
[11] (::VSCodeServer.var"#217#218"{VSCodeServer.NotebookRunCellArguments, String})()
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/serve_notebook.jl:24
[12] withpath(f::VSCodeServer.var"#217#218"{VSCodeServer.NotebookRunCellArguments, String}, path::String)
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/repl.jl:276
[13] notebook_runcell_request(conn::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, params::VSCodeServer.NotebookRunCellArguments)
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/serve_notebook.jl:13
[14] dispatch_msg(x::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, dispatcher::VSCodeServer.JSONRPC.MsgDispatcher, msg::Dict{String, Any})
@ VSCodeServer.JSONRPC ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/JSONRPC/src/typed.jl:67
[15] serve_notebook(pipename::String, debugger_pipename::String, outputchannel_logger::Base.CoreLogging.SimpleLogger; error_handler::var"#5#10"{String})
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/serve_notebook.jl:147
[16] top-level scope
@ ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/notebook/notebook.jl:35
7 Avoid memory allocation: some examples
7.1 Transposing matrix is an expensive memory operation
In R, the command
t(A) %*% x
will first transpose A then perform matrix multiplication, causing unnecessary memory allocation